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ABSTRACT 

This document presents standards for educational 
accountability systems that represent models of practice derived from three 
perspectives: research knowledge, practical experience, and ethical 
considerations. These standards should be conceived of as targets for state 
and local systems and as criteria to judge proposed models of accountability 
development. The 22 standards outlined in this report are grouped into these 
categories: (1) standards on system components; (2) testing standards; (3) 

stakes; (4) public reporting formats; and (5) evaluation. (SLD) 



Reproductions supplied by EDRS are the best that can be made 
from the original document. 
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Standards for Educational 
Accountability Systems 

CRESST Co-Directors Eva L. Baker, Robert L. Linn, 

Joan L. Herman, and 
CRESST Associate Director Daniel Koretz 



The Standards for Educational Accountability Systems is a collaborative project between CRESST and the Consortium for 
Policy Research in’ Education (CPRE). The standards reflect input from Eva L. Baker, Robert L. Linn, Joan L. Herman, 
Daniel Koretz, and Richard Elmore, as well as reviewers from professional organizations, educational institutions, and 
commercial test producers. The Standards will appear in a forthcoming book edited by Susan Fuhrman and Richard Elmore, 
Redesigning Accountability Systems (New York: Teachers College-Press).^ • . ' • 



rrm HE passage of the education reform law has 
spotlighted testing and accountability once 
again. Provisions to test students in Grades 3-8, 
to develop approaches for measuring adequate yearly 
progress, and to reach full proficiency in 12 years are 
among the salient features of the law that states will be- 
gin to address. While the details of 
implementation remain to be worked 
out, it is clear that all states will now 
review the present form of their testing 
programs and accountability systems to 
determine how they will be changed to 
meet these new expectations. Now is 
the time for states, as they reflect and prepare for action, 
to consider anew the true quality of their future efforts. 
What gauge should be used to determine the quality of 
accountability plans and operations? 

We believe that research, development, and evalua- 
tion knowledge can assist states in sorting through their 
options and in improving quality. CRESST, in partnership 
with the Consortium for Policy Research in Education 
(CPRE), with the Education Commission of the States 
(ECS), and with advice and review from numerous col- 
leagues in research and practice, offers the Standards for 
Educational Accountability Systems. These standards are 
intended to provide guidance to states and districts in con- 
ducting self-reviews of their own systems and to delin- 
eate criteria by which developing accountability systems 
can be judged. The Standards for Educational Account- 



Now is the time for states, as they 
reflect and prepare for action, to 
consider anew the true quality of 
their future efforts. 



ability Systems represent compiled knowledge developed 
from sources including the Standards for Educational and 
Psychological Testmg (AERA, APA, & NCME, 1999), re- 
search findings on testing and accountability systems, and 
studies of best practices. The Standards for Educational 
Accountability Systems also stress the importance of un- 
derstandable description of account- 
ability systems and clear reporting of 
resuits. 

Because experience with account- 
ability systems is still developing, the 
standards we propose are intended to 
help evaluate existing systems and to 
guide the design of improved procedures. The standards 
strongly endorse each state's responsibility to conduct con- 
tinuing evaluation of its own accountability system. It is 
not possible at this stage in the development of account- 
ability systems to know in advance how every element of 
an accountability system will actually operate in practice 
or what effects it will produce. Evaluations, conducted in- 
house or by universities, external organizations, or teams 
of experts, are essential if states are going to learn system- 
atically from one another and for the nation to judge the 
effectiveness of its efforts for children. Evaluation results 
will be essential to the continuing improvement of testing 
programs and accountability provisions. 

In sum, the standards offered below represent mod- 
els of practice derived from three perspectives: research 
knowledge, practical experience, and ethical consider- 
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ations. They should be conceived of as targetsTor state and 
local systems and as criteria to judge proposed models of 
accountability development. 

It should be understood that tests included in an account- 
ability system should meet the Standards for Educational and 
Psychological Testing (AERA, APA, & NCME, 1999). What we 
have highlighted here are criteria that apply especially to ac- 
countability systems. It is likely also that additional standards 
will be subsequently developed based on evaluations of ac- 
countability system effects. 

A. Standards on System Components 

1. Accountability expectations should be made 
public and understandable for all partici- 
pants in the system. 

Comment: Explicit information about ex- 
pectations is a prerequisite for partici- 
pants to perceive the accountability sys- 
tem as fair. It is also needed to allow par- 
ticipants to meet expectations and to 
monitor their progress. 

2. Accountability systems should employ different types 
of data from multiple sources. 

Comment: Although measures of student 
achievement may be of primary interest for 
accountability purposes, it is important also 
to obtain information about student and 
teacher characteristics to provide context for 
interpreting student achievement. It also is 
important to consider other student outcome 
data such as attendance, mobility, and rates 
of retention in grade, dropout and graduation. 
Moreover, it is important to obtain data on in- 
structional resources and curriculum materi- 
als, and about the degree to which students 
are provided with adequate opportunity to 
learn the content specified in content stan- 
dards and curriculum materials. 

3. Accountability systems should include data elements 
that allow for interpretations of student, institution, 
and administrative performance. 

Comment: Students, teachers, administrators, 
and-policymakers have a shared responsibil- 
ity for achieving the results expected by ac- 
countability systems. The system needs to . 
provide the information for each of these par- 
ties to know what actions need to be taken. 



Many students who would 
have been excluded in the 
past can be included without 
any alterations in the test or 
administration conditions. 




4. Accountability systems should in- 
clude the performance of all students, 
including subgroups that historically 
have been difficult to assess. 

Comment: Previous practices that 
excluded many students from test- 
ing because of absence on the day 
of test administration, limited En- 
glish proficiency, or student dis- 
abilities gave a distorted and usu- 
ally exaggerated view of overall 
performance. They also precluded 
accountability for the performance 

of excluded 

students. Legal 
requirements as well as ethi- 
cal considerations demand 
that all students be included 
in the accountability sys- 
tem. Many students, who 
would have been excluded 
in the past can be included without any alter- 
ations in the test or administration conditions. 

Some accommodations in administration con- 
ditions will be required for other students, and 
for some students the test will need to be modi- 
fied, or alternative assessments used, in order 
for the students to be included in the account- 
ability system. No student should be left out of 
the system, however. 

5. The weighting of elements in the system, including dif- 
ferent types of test content, and different information 
sources, should be made explicit. 

Comment: Making sense of overall accountabil- 
ity indices requires an understanding not only 
of the elements that go into the index, but of 
the weights that are assigned to each element. 

It is informative to provide not only the weights 
that are assigned to the different elements by 
policy, but also information about how much 
each element affects the overall index. The re- 
lationship of an element to a weighted account- 
ability index depends on the variability of the 
element across institutions as well as the weight 
assigned to the element by policy. 

6. Rules for determining adequate progress of schools and 
individuals should be developed to avoid erroneous 
judgments attributable to fluctuations of the student 
population or errors in measurement. 
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Comment: Progress based on student test av- 
erages reflecting performance of the total or 
subgroup. is usually not regular because of 
changes in school populations, measurement 
error, and other situational factors. Approaches 
that capture the longitudinal performance of 
individuals (along with an indicator of 
the proportion such longitudinal data 
represent) can help minimize inappro- 
priate inferences. Other strategies in- 
clude using more than one year's dif- 
ference to compute growth. 

B. Testing Standards 

7. Decisions about individual students 
should not be made on the basis of a 
single test. 



9. The validity of measures that have been administered 
as part of an accountability system should be docu- 
mented for the various purposes of the system. 

Comment: Validity is dependent on the specific 
uses and interpretations of test scores. It is in- 
appropriate to assume that a test that is valid 
when used for one pur- 
pose will also be valid for 
other uses or interpreta- 
tions. Hence, validity 
needs to be specifically 
evaluated and docu- 
mented for each purpose. 

10. If tests are to help improve 
system performance, there 
should be information provided 
to document that test results are modifiable by quality 
instruction and student effort. 



The importance of obtaining other 
information to confirm or discon- 
firm the information provided by 
a single test score increases as the 
importance of the decision and 
the stakes associated with it in- 
creases. 



Comment: There are several reasons for this 
standard. First, no test is perfectly reliable. 
There is always a degree of uncertainty associ- 
ated with any test score. That uncertainty needs 
to be taken into account when making deci- 
sions about individual students. Second, all 
tests have less than perfect validity. Hence, it 
is important to look for other information that 
will 'either support or disconfirm the informa- 
tion provided by a single test score. The im- 
portance of obtaining other information to con- 
firm or disconfirm the information provided 
by a single test score increases as the impor- 
tance of the decision and the stakes associated 
with it increases. Yet another reason for mul- 
tiple sources of information is the limitation of 
a single measure as a sample of the domain(s) 
of interest. 
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Multiple test forms should be used when there are re- 
peated administrations of an assessment. 

Comment: The items contained on a test form 
are only a sample of the domain that the 
test is intended to measure. Learning the 
answers to the items on a single form 
by focusing exclusively on those items 
is not the same as learning the material 
for the domain of content the test is 
intended to measure. Consequently, it 
is important to evaluate the 
generalizability of performance by 
administering a different form when a 
test is administered for a second or third 
time. 



Comment: Tests need to be sensitive to differ- 
ences in instructional quality and student ef- 
fort in order to be useful as tools in improving 
system performance. Sensitivity to instruction 
and to student effort is also a prerequisite for 
fairness if educators and students are to be held 
accountable for results. 

11. If test data are used as a basis of rewards or sanctions, 
evidence of technical quality of the measures and error 
rates associated with misclassification of individuals or 
institutions should be published. 

Comment: Because tests are fallible measures, 
classification errors are inevitable when tests 
are used to classify students or institutions into 
categories associated with rewards or sanctions. 

In order to judge whether the risk of errors is 
acceptably low, it is essential that information 
be provided about the probability of mis- 
classifications of various kinds. 

12. Evidence of test validity for students with different lan- 
guage backgrounds should be made publicly available. 

Comment: Validity needs to be assessed sepa- 
rately for students with different language 
backgrounds. Whether a test is administered 
in English or in a student's primary language, 
validity of the test for students of different lan- 
guage backgrounds cannot be assumed from 
evidence based only on test results of students 
whose first language is English. Testing stu- 
dents in their primary language may be re- 
quired for some students. However, translation 
and adaptation of tests to different languages 
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is a complex undertaking. There are many 
threats to validity of tests administered in dif- 
ferent languages. Lack of consistency between 
the language of the test and the language of 
instruction is one of the major threats to valid- 
ity and needs to be evaluated to the extent fea- 
sible. 

13. Evidence of test validity for children with disabilities 
should be made publicly available. 

Comment: Accommodations may be needed 
for some students with disabilities to be able 
to participate in testing in a 
meaningful way. The goal of ac- 
commodations is to remove 
sources of difficulty that are irrel- 
evant to the intent of the mea- 
surement. That is, an accommo- 
dation should make it possible 
for a student with disabilities to 
demonstrate her knowledge and 
skills in the content domain being tested so that 
the score reflects that knowledge and skill 
rather than the student's disability. The accom- 
modation should level the playing field; it is 
not intended to give the student with a disabil- 
ity an advantage over other students. The vali- 
dation task is to provide evidence that the test 
is reflecting the student's knowledge and skills 
and not her specific disability. For students with 
severe disabilities, assessments may need to be 
modified, or alternative assessments may need 
to be selected or developed, possibly designed 
to assess different learning goals than those of 
the assessments used for the majority of stu- 
dents. Evidence regarding the validity of inter- 
pretations made from modified or alternative 
assessments should be provided to the extent 
feasible. 

14. If tests are claimed to measure content and performance 
standards, analyses should document the relationship 
between the items and specific standards or sets of stan- 
dards. 

Comment: The degree of 
alignment of a test with con- 
tent standards may be evalu- 
ated, for example, by provid- 
ing a mapping of the test 
specifications to the content 
standards. Such a mapping 
can reveal areas of the content 
standards that are not in- 



cluded in the test specifications as well as ar- 
eas that are lightly or heavily sampled in the 
test specifications. The mapping may also re- 
veal areas tested that are not part of the' con- 
tent standards. Performance standards gener- 
ally provide verbal descriptions of performance 
levels that are considered satisfactory or exem- 
plary. The degree to which the descriptions map 
directly to the test items and the correspon- 
dence of the performance standards to the cut 
scores on the test need to be documented and 
evaluated. 



C. Stakes 

15. Stakes for accountability systems 
• should apply to adults and students and 
should be coordinated to support sys- 
tem goals. 

Comment: Asymmetry in stakes 
may have undesirable consequences, both per- 
ceived and real. For example, if teachers and 
administrators are held accountable for student 
achievement but students are not, then there 
are likely to be concerns about the degree to 
which students put forth their best effort in tak- 
ing the tests. Conversely, it may be unfair to 
hold students accountable for performance on 
a test without having some assurance that 
teachers and other adults are being held ac- 
countable for providing students with adequate 
opportunity to learn the material that is tested. 
Incentives and sanctions that push in opposite 
directions for adults and for students can be 
counterproductive. They need to be consistent 
with each other and with the goals of the sys- 
tem. 

16. Appeal procedures should be available to contest re- 
wards and sanctions. 

Comment: Extenuating circumstances may call 
the validity of results into question. For ex- 
ample, a disturbance .during test administra- 
tion may invalidate the test results. Also, indb 
viduals may have information that leads to con- 
flicting conclusions about performance. Appeal 
procedures allow for such additional informa- 
tion to be brought to bear on a decision and 
thereby enhance its validity. 

17. Stakes for results and their phase-in schedule should 
be made explicit at the outset of the implementation of 
the system. 



The accommodation should 
level the playing field; it is not 
intended to give the student 
with a disability an advantage 
over other students. 
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Comment: Making plans for phasing in stakes 
for results is part of making accountability ex- 
pectations explicit to participants. Explication 
of plans allows participants to make informed 
decisions about how best to achieve the ends 
expected by the accountability system. 

18. Accountability systems should begin with broad, dif- 
fuse stakes and move to specific consequences for indi- 
viduals and institutions as the sys- 
tem aligns. 

Comment: Starting with broad, 
diffuse stakes (e.g., public report- 
ing of aggregate achievement re- 
sults for schools) allows partici- 
pants time to make the changes 
needed to meet expectations be- 
fore being confronted with spe- 
cific rewards or sanctions for performance (e.g., 
monetary rewards to schools or teachers, 
graduation requirements for students). Ad- 
vance warning and phasing-in of stakes en- 
hances both the perception of fairness and the 
actual fairness of the accountability system. 



Advance warning and phasing- 
in of stakes enhances both the 
perception of fairness and the 
actual fairness of the account- 
ability system. 



Comment: Interpretations of results can be en- 
riched by the reporting of consistencies and in- 
consistencies provided by multiple indicators 
of performance. Performance by subgroups 
needs to be considered to ensure that overall 
results do not conceal great disparities in sub- 
group performance. Understanding the degree 
of uncertainty in results can reduce the likeli- 
hood of misinterpretation and en- 
hance the likelihood of appropri- 
ate use of results. 



E. Evaluation 



21. Longitudinal studies should be 
planned, implemented, and reported 
evaluating effects of the accountability 
program. Minimally, questions should determine the 
degree to which the system 

a. builds capacity of staff; 

b. affects resource allocation; 

c. supports high-quality instruction; 

d. promotes student equity access to 
education; 



D. Public Reporting Formats 

19. System results should be made broadly available to the 
press, with sufficient time for reasonable analysis and 
with clear explanations of legitimate and potential ille- 
gitimate interpretations of results. 

Comment: The press plays an important role 
in the interpretation of the results produced by 
accountability systems. Legitimate interpreta- 
tions of results require an understanding of 
what goes into them and some of their techni- 
cal characteristics. Those responsible for the ac- 
countability system also have a responsibility 
to help ensure proper interpretation of the re- 
sults and to minimize inappropriate interpre- 
tations to the extent possible. Efforts to assist 
the press in understanding the results, their 
strengths and limitations, and the legitimate 
and illegitimate interpretations can pay consid- 
erable dividends in improved coverage by the 
press and better understanding by the public. 

20. Reports to districts and schools should promote appro- 
priate interpretations and use of results by including 
multiple indicators of performance, error estimates and 
performance by subgroups. 



e. minimizes corruption; 

f. affects teacher quality, recruitment, 
and retention; and 

g. produces unanticipated outcomes. 

Comment: The primary purpose of educational 
accountability systems is to improve instruc- 
tion and student learning. The overarching 
evaluation question is the degree to which the 
intended benefits are realized and the costs in 
terms of unintended negative consequences are 
minimized. Listed items (a) through (d) reflect 
intended positive consequences the realization 
of which is the focus of evaluation. Items (e) 
and (g) emphasize the needed evaluation of 
plausible unintended 
negative consequences. 

Item (f) requires the 
evaluation of both in- 
tended positive and unin- 
tended negative influ- 
ences of the accountabil- 
ity system. 
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22. The validity of test-based inferences should be subject 
to ongoing evaluation. In particular, evaluation should 
address 

a. aggregate gains in performance 
over time; and 

b. impact on identifiable student 
and personnel groups. 

Comment: Gains in performance may be spuri- 
ous or real. Evaluation of the gains may be aided 
by investigations of the degree to which gains 
on the measures used by the accountability sys- 
tem are reflected in changes on alternative indi- 
cators of performance obtained from other tests 
or more general indicators, such as performance 
beyond school in college or the workplace. Dif- 
ferential effects on identifiable student or per- 
sonnel groups may lead to different conclusions 
than those that are supported by the overall ag- 
gregate performance. 
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